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A number of known techniques for improving cache performance in scientific 
computations involve the reordering of the iteration space. Some of these reorder- 
ings can be considered coverings of the iteration space with sets having small surface- 
to- volume ratios. Use of such sets may reduce the number of cache misses in compu- 
tations of local operators having the iteration space as their domain. First, we derive 
lower bounds on cache misses that any algorithm must suffer while computing a lo- 
cal operator on a grid. Then, we explore coverings of iteration spaces of structured 
and unstructured discretization grid operators which allow us to approach these 
lower bounds. For structured grids we introduce a covering by successive minima 
tiles based on the interference lattice of the grid. We show that the covering has 
a small surface-to- volume ratio and present a computer experiment showing actual 
reduction of the cache misses achieved by using these tiles. For planar unstructured 
grids we show existence of a covering which reduces the number of cache misses 
to the level of that of structured grids. Next, we introduce a class of multidimen- 
sional grids, called starry grids in this paper. These grids represent an abstraction 
of unstructured grids used in, for example, molecular simulations and the solution 
of partial differential equations. We show that starry grids can be covered by sets 
having a low surface-to-volume ratio and, hence have the same cache efficiency as 
structured grids. Finally, we present a triangulation of a three-dimensional cube 
that has the property that any local operator on the corresponding grid must incur 
a significantly larger number of cache misses than a similar operator on a structured 
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grid of the same size. 



1 Introduction 

A number of known techniques for improving cache performance in scientific compu- 
tations involve the reordering of the iteration spaceQ. We present two new methods 
for partitioning structured grid iteration spaces with minimum-surface cache fitting 
sets. Such partitioning^ reduce the number of cache misses to a level that is close to 
the theoretical minimum. We demonstrate this reduction by actual measurements 
of cache misses in computations of a second order stencil operator on structured 
three-dimensional grids. 

A good tiling of the iteration space for structured discretization grids in the 
presence of limited cache associativity can be constructed by using the interference 
lattice of the grid. This lattice is a set of grid indices mapped into the same word in 
the cache or, equivalently, a set of solutions of the Cache Miss Equation [Q. In [|| we 
introduced a (generally skewed) tiling of the iteration space of stencil operators on 
structured grids with parallelepipeds spanned by a reduced basis of the interference 
lattice. We showed that for lattices whose second shortest vector is relatively long, 
the tiling reduces the number of cache misses to a value close to the theoretical 
lower bound. Constructing the skewed tiling, however, is a nontrivial task, and 
involves a significant overhead in testing whether a particular point lies inside the 
tile. Tiling a three-dimensional grid, for example, requires the determination of 29 
integer parameters to construct the loop nest of depth six, and involves a significant 
branching overhead. 

We start the paper with deriving lower bounds on the number of cache misses 
that any algorithm must suffer while computing a local operator on either struc- 
tured or unstructured gridsQ. Subsequently, we introduce two new coverings of 
structured grids: Voronoi cells and rectilinear parallelepipeds built on the vectors 
of successive minima of the grid interference lattice. In lattices with a relatively 
long shortest vector the cells of both coverings have near-minimal surface-to- volume 
ratios. Hence, the number of cache misses in the computations tiled with these cells 
is close to the theoretical minimum derived in |2j . 

For the computation of local discretization operators on planar, fixed-degree 
unstructured grids we construct a near-minimum-perimeter covering by applying the 
Lipton-Tarjan planar graph separator algorithm The perimeter-to-area ratio of 
the sets of this covering is 0(1/ VS), where S is the cache size. Next, we introduce 
a class of multidimensional grids, called starry grids, and show that a d-dimensional 
starry grid can be covered by sets having a surface-to-volume ratio close to that of 
structured grids, i.e. 0(S~ x / d ). 

Lastly, we construct a fixed-degree unstructured grid that triangulates a three- 

1 In U wc laid the foundation for the study of reorderings to improve cache performance of 
local operators on structured grids. We extend those results in this paper to the construction 
of practical, near-optimal tilings of structured grids, and to the study of cache misses for local 
operators on unstructured grids. 

2 The derivation is a simplification of the proof given in p| for structured grids and extends its 
application to unstructured grids. 



dimensional cube and show that the grid can not be covered with sets having small 
surface-to-volume ratios. The last result shows that any computation of a local 
operator on such a three-dimensional grid must suffer a larger number of cache 
misses than the computation of a similar operator on a structured grid of the same 
size. Hence, general unstructured grids of dimension higher than two are provably 
less efficient in a cache utilization than structured grids. 

2 Cache and Discrete Geometry 

Local operators on grids. We consider the problem of computing array q by applying 
a local discretization operator K to array u (i.e. q — Ku) defined at the vertices 
of an undirected graph G = (V, E) [J Locality of K means that computation of 
q(x), x £ V, involves values of u(y), y € V, where y is at a (graph) distance^ at 
most r from x. This r is called the order of K and is assumed to be independent of 
G. Arrays q and u are distinct. Hence, the values of q can be computed in arbitrary 
order. 

We consider structured and unstructured grids that have an explicit or implicit 
embedding into Euclidean space. Structured grids can be defined as Cartesian 
products of line graphs, while edges of unstructured grids are defined explicitly, for 
example by an adjacency matrix. We assume that the maximum vertex degree is 
independent of the total number of vertices. A grid is called planar if its vertices 
and edges can be embedded into a plane without edge intersections. A grid is called 
a triangulation of a body B if it can be represented as a one-dimensional skeleton 
of a simplicial partition of B (see Q). 

Cache Model. We consider a single-level, virtual-address-mapped, set-associa- 
tive data cache memory, see The memory, with a total capacity of S words, 
is organized in z sets of a {associativity) lines each. Each line contains w words. 
Hence, the cache can be characterized by the parameter triplet (a,w,S), and its 
size S equals a * z * w words. A cache with parameters (S/w,w, S) is called fully 
associative, and a cache with parameters (1, us, S) is called direct-mapped. 

The cache memory is used as a temporary fast storage of words used for pro- 
cessing. A word at virtual address A is fetched into cache location (a(A), z(A), w(A)), 
where w(A) = A mod w, z(A) = (A/w) mod z, and a(A) is determined according 
to a replacement policy (usually a variation of least recently used). The replace- 
ment policy is not important within the scope of this paper since our lower bounds 
are valid for any replacement policy and our upper bounds hold even for a direct 
mapped cache. 

The number of cache misses incurred in computation of q depends on the 
order in which elements of u are stored in the main memory. We assume that for 
structured grids an element u(ii, . . . , id) is stored at address A = ii+nii2+nin2t3 + 
■ — \-n\ ■ ■ ■ n,d-iid, where n\, ■ ■ ■ , rid-i are the grid sizes. For unstructured grids we 
don't assume any particular ordering in memory of the grid points (and, hence, of 

3 We are particularly interested in graphs that can be interpreted as discretization grids, and 
will often use the words "grid" and "graph" interchangeably. 

4 The graph distance is the length of a shortest path connecting two vertices. The length of the 
path is the number of edges it contains. 



elements of u). Instead, we choose an ordering that reduces the number of cache 
misses. 

Replacement loads. A cache miss is defined as a request for a word of data 
that is not present in the cache at the time of the request. A cache load is defined 
as an explicit request for a word of data for which no explicit request has been 
made previously (a cold load), or whose residence in the cache has expired because 
of a cache load of another word of data into the exact same location in the cache 
(a replacement load). Cache load is used as a technical term for making some 
formulas and their proofs shorter. The definitions of cold and replacement loads 
are analogous to those of cold and replacement cache misses , respectively, and if 
w equals 1 they completely coincide. 

Isoperimetric Sets. In addition to inevitable cold loads, some replacement 
loads must occur, as formalized in Lemma |], because some values have to be dropped 
from cache before they are fully utilized. This leads us to a lower bound on cache 
misses, see Theorems |[ |[ This bound includes the surface-to-volume ratio of 
isoperimetric subsets of the grid, that is, sets whose boundary size is 3S and that 
have maximum volume. 

Surface-to-volume ratio. Some replacement loads may be avoided by careful 
scheduling of the computations. One technique for minimizing the number of re- 
placement loads is to cover the grid G with conflict-free sets Vi (V = Ui=i ^ 1^1 = 
S), that is, sets whose vertices are all mapped to different locations in cache. If we 
calculate q in all vertices of Vi before calculating it in vertices of Vj , j = i + 1, . . . , k, 
a replacement load can occur only at vertices having neighbors in at least two sets 
(boundary vertices). We consider only bounded-degree graphs, so if we can find 
a covering with sets V, having volumes \Vi\ close to S and a minimal number of 
boundary vertices \dVi\ (and boundary edges) i.e., bodies with minimum surface- 
to- volume ratio, then the computation of q will incur a number of replacement loads 
close to the minimum. On the other hand, the total partition boundary 53j=i \^Vi\ 
can be used to obtain a lower bound on the number of replacement loads, see sec- 
tion cf . [| . Deviating from the literal meaning of the word "isoperimetric" and 
following common practice, we call sets having the minimum number of boundary 
points for a given number of total points also isoperimetric. 



3 A lower bound on cache misses for local operators 

In this section we consider the following problem: for a given grid G and a local 
operator K, how many cache misses must be incurred in order to compute q = Ku, 
where q and u are two distinct arrays defined on the grid. We provide a lower bound 
on the number of cache misses in any algorithm, regardless of the order in which 
the grid points are visited during the computation of q. The lower bound contains 
the surface-to- volume ratio of isoperimetric sets of the grid. This ratio can be well 



estimated for structured grids, and for FFT grids, to be described in Section 5.2. 
It may be shown that our lower bound is tight for the above-mentioned grids, and 
in general for any grid which may be covered by sets of size S having an optimal 
surface-to- volume ratio. 



We use the following terminology to describe the operator K. Locality of K 
means that the value of q at grid point a; is a function of the values u(y), where y is 
a grid point at a graph distance at most r from x (r is called the radius, or order, 
of K). We call an operator of radius r symmetric if it uses values of u at all grid 
points within distance r from the target point. In this section we obtain a lower 
bound for a symmetric operator of unit radius, which obviously is a lower bound 
for symmetric operators of larger radii as well. 

3.1 Pointwise Computations 

Depending on the separability of the local operator, it has to be computed pointwise, 
or it may allow computation in edgewise fashion. If an operator is not separable, it 
requires values of u at all neighbor points simultaneously to compute a value of q. 
We call such computation pointwise. In this subsection we assume that computation 
of q on the grid G is performed in a pointwise fashion, that is, at any grid point 
the value of q is computed completely before computation of the value of q at 
another point is started. Separable operators, which can be evaluated edgewise, are 
considered in the next section. 

In order to compute the value of q at a grid point x, the values of u at the 
neighbor points of x must be loaded into the cache (points y and x are neighbors if 
they are connected by a grid edge). If a; is a neighbor of y and u(y) has been loaded 
in cache to compute q(z) but is dropped from the cache before q(x) is computed, 
then u(y) must be reloaded, resulting in a replacement load. Our lower bound on 
cache loads is based on the following simple observation. 

Lemma 1. Let values of u at a vertex set X first be used for computation of q in a 
vertex set Uq, and next for computation of q in a vertex set U±, UoHUi — 0. This 
computation incurs at least \X\ — S replacement loads. 

Proof. Since the cache can hold at most S values, at least \X\ — S values must 
have been dropped from the cache by the time computations at the points of Uq 
are done. Since all values at points of X are necessary for computing values at U\, 
they have to be reloaded into the cache. Hence, computations of values at points 
of Ui will result in at least \X\ — S replacement loads. □ 

To estimate for a given algorithm the number of elements, p, of array u that 
must be replaced, we partition V into a disjoint union of k sets Vi, with V = U* =1 V$, 
in such a way that q is computed at all points of Vi before it is computed at any 
point of Vi + i, see Figure [j]. Let B\ and Bf be (possibly overlapping) subsets of Vi 
which have neighbors in Uj^Vj and UjyiVj, respectively. The set Bi = B\ U Bf is 
the boundary of Vi. 

Replacement loads of values at the points of V, will be incurred when such 
points show up as neighbors in other vertex sets. We apply Lemma [j] to the two 
pairs of sets in which boundary points of Vi occur as members and neighbors: 



Uj<iVj and Vi 
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Figure 1. Iteration space V is partitioned in a sequence of regions Vi. 
Reloading of some values of u on the boundary of V3 (heavy and dotted lines, cor- 
responding to I-B3I and I-B3 |, respectively) results in at least \B l 3 \ + \B%\ — 2S cache 
loads. 

Vi and Uj>i Vj 
This gives us the following bounds on replacement loads: 

A>\B\\-S 

and 

px>m-s, 

respectively. Hence, in each set of the partition at least pi values have to be reloaded 
(see Figure |]), with 

Pi =p\+P?> \Bl\ + \B?\ -2S> \B t \ - 2S. 

The total number of reloaded values in the course of computing q on the entire 
grid will be at least p — Yli=i Pii ^ or which we have the following bound: 



k k 

p = Y.p^H\ B ^~ 2kS - 



Let v be the maximum number of points in a set with 35 points on its bound- 
ary, i.e. let v be the number of points in an isoperimetric set with 3S boundary 
points. Choosing a partition V — U* =1 Vi in such a way that \Vl\ — v, we have 
\Bi\ > 35, hence 



P >S^. ,1) 

v 



Thus we have the following result. 



Theorem 2. The number of cache misses fi in the evaluation of a symmetric 
operator on a grid G = (V, E) is bounded as follows: 

tx>h\V\+p)>-\V\(l + \a(G)) 
w w 3 

where ct(G) is the surface-to-volume ratio of isoperimetric subsets of V with 3S 
boundary points. 

Proof. We sum the number of replacements in ([l]) with the number of cold loads 
\V\ and notice that § = l/3a(G). Then we notice that any cache miss results in 
a load of w words into the cache, and that in the best possible case all w words 
contain useful data. □ 



3.2 Edgewise Computations 

The pointwise calculation model used in the previous subsection is too restrictive 
in many cases. Using separability of the operator, the number of cache misses may 
be reduced. In this section we present lower bounds for more general computations 
where values of q may be updated multiple times in arbitrary order. We call these 
computations edgewise. 

An edgewise computation is performed at the vertices of a bipartite graph 
H = (V 1 , V , E) where, V 1 and V are vertex sets, both equal to V, and (x, y), x £ 
V 1 , y G V° , is an edge in H iff x and y are neighbors in G or x = y. The values of 
u are given in the nodes of V 1 , and values of q have to be computed at the nodes 
of V ° . An atomic step in an edgewise computation is the evaluation of a function 
of two variables corresponding to an edge of H . If a value at an end point of the 
edge is currently not loaded, then the computation suffers a cache load. All edges 
of the grid must be visited. We want to estimate the number of cache loads that 
each computation of q = Ku must suffer. 

The arguments in the proof of the lower bound of Theorem |^ can be modified 
by partitioning the edges of H into disjoint sets Ei, with E — \J^_ x Ei, in such a 
way that processing of any edge in Ei precedes processing of any edge in Bj+i, and 
such that the size of boundary of Ei is at least 3S. Here the boundary of an edge 
set Ei is the set of vertices shared by an edge in Ei and an edge not in Ei. The 
surface-to- volume ratio of an edge set is the ratio of its number of boundary vertices 
to the number of its edges. We define f3{G) to be the minimum surface-to-volume 
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ratio of the edge sets in H (derived from G) having a boundary of size 35. In other 
words /3(G) is the number of boundary points divided by the number of edges in 
isoperimetric sets in H having 35 boundary points. The following result can be 
proved using arguments completely analogous to those in Theorem |^. 

Theorem 3. The number of cache misses in a calculation of a separable symmetric 
operator on a grid G = (V, E) is bounded as follows: 

^>-\E\(i+y(G)) 

w 6 

where (3(G) is the minimum of the surface-to-volume ratio over subsets of E with 
35 boundary points. 



4 Structured Grids 
4.1 Interference Lattice 

Interference lattice. Let u be a <i-dimensional array defined at the vertices of a 
structured d-dimensional grid of size n\ x • • • x n d . Let L be a set in the index 
space of u having the same image in cache as the index (0, . . . , 0). L is a lattice in 
the sense that there is a generating set of vectors {hi}, i = 1, . . . , d, such that L 
is the set of grid points {X^=i x J°i I x i € Z}. We call L the interference lattice of 
u. For a direct- mapped cache it can be defined as the set of all vectors (ii, . . . , id) 
that satisfy the Cache Miss Equation |t]]: 

(ii + 7iii 2 + nin 2 i3, + ■ • • + «i • ■ • n d -iid) mod 5 = 0. (2) 

We will use some geometrical properties of lattices. Let B be a convex body 
of volume V, symmetrical about the origin. The minimal positive Xi such that XiB 
contains i linearly independent vectors of L is called the i th successive minimum 
vector of a lattice L relative to B. A theorem by Minkowski, see (Ch. VIII, Th. 
V), asserts that 

ii < nt^ < ^. (3) 

d\V ~ detL - V w 

Note that the ratios of lattice successive minima relative to the unit cube and to 
the unit ball can be bounded: l/d < Xf 1 /\\ < d. We use successive minima 



relative to the unit ball and the unit cube in the Sections 4.2 and 4.3 respectively. 
In both cases we call / = Xd/X\ the eccentricity of the lattice (not to be confused 
with eccentricity of a reduced basis, defined in [Ej, Section 4). The eccentricities 
relative to a ball and a cube may differ by a factor of d 2 at most. 

In Q we introduced a tiling by parallelepipeds built on a reduced basis of 
the interference lattice which decreases the number of the cache misses to a level 
close to the theoretical lower bound that we also derived. Measurement shows that 
this tiling incurs significantly fewer cache misses than a compiler-optimized code 
with canonical loop ordering. However, it has a high computational cost, since it 



depends on a significant number of integer parameters, and its implementation scans 
a substantial number of the grid points to select those suitable for cache conflict- 
free computations. This prompts us to consider alternative tilings, specifically with 
Voronoi cells and with successive minima parallelepipeds. We show that these tilings 
have good surface-to- volume ratios if the lattice eccentricity is small. 

4.2 Voronoi Tiling 

A Voronoi tiling is a tiling of the grid by completed cells C (Voronoi tile) of the 
Voronoi diagram of the interference lattice. For each lattice point x a Voronoi 
cell is the set of points which are closer to x than to any other lattice point. All 
integer points inside each Voronoi cell are mapped into the cache without conflicts. 
Voronoi cells may not form a tiling since some integer points can be located on a 
cell boundary. There are many qualitatively equivalent ways to complete the cells 
to form a tiling. One way is to choose a basis in the space of the lattice and assign 
an integer point to the cell whose center is lexicographically closest to the point. 

In order to estimate the surface-to-volume ratio of a Voronoi cell C we note 
that the completed Voronoi cells form a tiling of space. Hence, the volume of C 
equals the determinant of the lattice (defined by (Q)), which is equal to S, see Eqj. 
On the other hand, let C a be a Voronoi cell centered at o. Each vertex v of C is 
equidistant from d lattice points. Let r be that distance, which will in general be 
different for different v. According to the definition of the Voronoi cell, the ball of 
radius r, centered at v, contains no other lattice points. Hence r < R, where R 
is the radius of a maximal ball of the lattice (a lattice-points-free ball of maximal 
radius). Hence, C is contained in a ball of radius R, centered at o. Thus, the 
surface area of C is bounded by the surface of a ball of radius R, which equals 
dVd where Vd — r(T+rf/2) ^ s * nc vommc °f the unit <i-dimensional ball (see |l| 



We estimate the radius of the maximal ball R by induction on the dimension 
of a sublattice. Let Ri be the radius of the maximal ball inscribed into the lattice 
Li generated by the first i minima vectors Vj of L relative to the i-dimensional ball. 



where hi is the distance between hyperplanes spanned by L,_i and + v;, 

respectively, and Xi is the i th successive minimum. Induction on i gives the following 
result. 

Lemma 4. The following relation holds for the radius of a lattice-points-free ball 
in a d- dimensional lattice L: 



Ch. IX.7). 



Then we have (see Figure ||) : 



i??<(^/2) 2 + i?t 1 <(A,/2) 2 + J R 1 2 



'%— 1 5 



d 




(4) 



where Ai < • • • < Ad are successive minima of L. 
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Figure 2. Let Li be a lattice generated by the first i minima vectors of L. 
An intersection of a maximal ball of Li with the subspace of Li-\ is contained in a 
maximal ball of Hence R\ < (hi/2) 2 + Rf_ 1 where hi is the distance between 

subspaces Li^i and + Li-\. 

Since C is contained in a ball of radius R and both C and ball are convex 
bodies, the surface area A of C does not exceed the surface area of the ball and we 
have an estimation 

A(C) < dVdR' 1 - 1 < d(Vd/2) d - 1 V d Xd d - 1 < d^V^/^- 1 ) 2 /^-!)/^ 

where we used the estimation A^ < ■ 2 ^f d ~ 1 S derived from (||), and the bound 
R < ^frXd which follows from (||). This implies the following result. 

Theorem 5. The surface-to-volume ratio of a lattice Voronoi cell C can be esti- 
mated as: 



A{C) 



< c d f^ 2 / d S-^ d 



V(C) 

where a is a constant depending on d only, and f is the lattice eccentricity. 

4.3 Successive Minima Tiling 

The Voronoi cell tiling has cache-conflict-free tiles of maximum possible volume 
S 1 , and of small surface-to-volume ratio. However, the tiles may have many faces, 
and it may be computationally expensive to traverse the grid points inside a tile. 
It is more desirable to use rectilinear tiles. A successive minima tiling is a tiling 
by Cartesian blocks constructed using successive minima lattice vectors of the unit 
cube. Such a block Q can be described by the system 



\xi\ <bi, i=l,...,d, (5) 

where Ai < bi < Xd- 

The block Q can be constructed by the following "inflating" process. Take an 
initial cube of the form rtq), with bi = 1, i = 1, . . . , d, and increment all bi equally 
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until the face Xi t = bi 1 contains a lattice point. Continue to increment values of all 
bj for which the face Xj = bj has no lattice points. At the end we obtain a block of 
the form (||), containing a lattice point on each of its faces and containing no lattice 
points inside except o. If each successive minimum vector has end points in distinct 
faces of the block, then 6, = A.; (after an appropriate reordering of the coordinates). 
On the other hand, it is not difficult to construct a three-dimensional lattice such 
that the block satisfies b\ = Ai, 62 = ^3 = A2 < A3, in which case the volume of the 
block would be strictly less than Ai • • • A^. 

Any translation of the block Q' = \Q obviously contains at most one lattice 
point and can be used for conflict free tiling. This block has a low surface-to- volume 
ratio if the lattice has bounded eccentricity, which can be seen from the following 
inequalities: A(Q') < 2d\ d ~ 1 and V{Q') > A?. Since Ai = *f > j( 1 ^) 1/d , as 
follows from (^), we obtain the following result. 

Theorem 6. The surface-to-volume ratio of the block Q' obtained in a successive 
minima procedure can be estimated as follows: 

< 2df d - 1 /\ 1 < d{dW d yl d f d S-^ d , 

where Vd is a constant depending on d only, and f is the lattice eccentricity. 
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Figure 3. Comparison of measured cache misses for the second order 
stencil operator as a function of the first grid dimension (AO < nx < 99, ny — 
97, nz — 99). The top curve shows the number of cache misses for the compiler 
optimized canonically ordered nest. The bottom curve is obtained for tilings with 
successive-minima parallelepipeds. 

As a representative example, the number of measured primary cache misses 
for tilings of three-dimensional grids of sizes 40 < nx < 99, ny — 97, nz = 99 with 
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successive minima parallelepipeds is shown in Figure [5| The computation evaluates 
a linear operator on the standard 13-point star stencil. Experiments were performed 
on an SGI Origin 2000 machine with a MIPS R10000 processor. 

It is easy to show that the lattice eccentricities / can be bounded by the 
eccentricity of a reduced basis as defined in ||, times a constant depending on d 
only. Hence, it follows from Theorems ^| and |^ that the Voronoi and successive 
minima tilings have the same asymptotic surface-to-volume ratio — and therefore 
the same cache efficiency — as the tiling with fundamental parallelepipeds derived 
in §. 

5 Unstructured Grids 
5.1 Lipton-Tarjan Covering 

In this section we present a covering of a planar bounded-degree grid with vertex sets 
of size at most S that have an average surface-to- volume ratio equal to 0(l/V~S)- 
The covering can be used for computing a first-order operator on a | V|-vertex planar 
grid with 0(\V \/V~S) replacement loads. The covering is based on the Lipton-Tarjan 
separator theorem, which asserts that any planar graph of \V\ vertices has a vertex 
separator of size C(-\/|V F |). The separator can be constructed in C(|V|) time ||. 

We consider bounded-degree unstructured grids, that is, grids having fixed 
maximum vertex degree d, independent of the grid size. (Calculation on unbounded- 
vertex-degree grids may cause approximation problems and numerical instability. 
Consequently, most numerical methods employ bounded-degree grids.) For bounded 
degree grids a node cut of size C(|y|) has a corresponding edge cut of size 
and vice versa. Hence, the Lipton-Tarjan separator theorem (see |8j, section 2, 
Corollary 2) can be reformulated as follows: By removing V^|) edges from a 

|V^|-vertex bounded-degree planar graph it can be separated into connected disjoint 
subgraphs, each having at most 2|V|/3 vertices. 

We construct the covering by applying the Lipton-Tarjan theorem recursively. 
First, we choose an edge cut C of the original graph G = (V,E), such that \C\ < 
coy/\V\, where Co is independent of |V|. According to the Lipton-Tarjan separator 
theorem, the cut can be chosen in such a way that it will split the graph into 
connected components Gi = (Vi,Ei), i — l,...,k, \Vi\ < 21^1/3. Adding an 
extra step in this partition, we can assume that \Vi\ < \V\/2 while |C| < c ly /~\V\ 
for a bigger constant c\. Then, we recursively bisect each connected component 
Gi = (Vi,Ei) while \Vt \ > S. We call this covering a Lipton-Tarjan covering. 

This partition process can be represented by a cut tree T whose nodes are 
partitioned connected components of the grid. The tree will be used to estimate 
the total number of removed edges to construct a Lipton-Tarjan covering. Let 
a connected component C, represented by node t of T, be split into components 
Ci, . . . , C m by a Lipton-Tarjan cut. We draw an edge between t and each of its 
child nodes representing Ci, . . . , C m . However, we do not include in T the connected 
components of size smaller than S. Hence, the size of each leaf (i.e. a node having 
no children) exceeds S. Connected components smaller than S do not contain any 
cut edges and they are not needed for the estimate. To each node t of T we assign 
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size s(t), which equals the number of vertices in the set represented by t, and weight 
w(t) = y/s(€j. 

Lemma 7. The total number of edges in all cuts of a Lipton-Tarjan covering is 

o(\v\/Vs). 

Proof. The total number of edges in all cuts made to construct a Lipton-Tarjan 
covering can be bounded by 0(o~(T)), where 

a(T)= 

t node of T 

We use two properties of the weights. First, 

E ™(1)<\V\/VS, (7) 

l leaf of T 

since the maximum of the \J s(l), subject to J2s(l) = \V\ and s(l) > S + 1, is 
attained for s(l) — S + 1 for all I. Second, for each node t of T we have 

w(t) <1/V2 £ to(r), (8) 

r child of t 

which follows from Proposition [l^ with d — 2, see Appendix. 

Now a(T) can be estimated in two steps. First, we replace weights in each 
nonleaf t by the right hand side of (JsJ) , going bottom up from the leaves to the 
root of T. This operation will not decrease the total weight. Second, we carry 
summation of new weights across nodes of T by noticing that each leaf I deposits 
into <t(T) a weight of at most 

w(l)(l + ^= + -p + ■ ■ •) < w{l)(2 + V2). 

Hence, it follows from ^ that 

a(T)<(2 + V2)\V\/^S, (9) 
so the total number of edges in all cuts is 0(\V\/VS). □ 

Now we prove the main result of this section. From a lower bound on the 
number of replacement loads for computations of star stencil operators on structured 
two-dimensional grids, see Section 0, cf . Q , it follows that the order of this bound 
can not be improved. 

Theorem 8. Any first-order operator K on a bounded- degree planar grid G — 
{V,E) can be computed with 0(\V\/\fS) replacement loads. 



Proof. First, we reorder vertices of the grid in such a way that vertices of each 
set of the Lipton-Tarjan covering will occupy contiguous memory locations. Then 
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we compute K on each set, one at a time. In this computation replacement loads 
can happen only at the vertices of the cuts of the grid. According to Lemma ^ the 
number of such vertices does not exceed 0(\V\/>/S). □ 



5.2 Covering of Starry Grids 

In the previous sections we showed that structured and planar grids can be covered 
by sets having low surface-to-volume ratios. That implies that local operators of 
fixed order on such grids can be evaluated with high efficiency of cache utilization. 
In this section we introduce a class of multi-dimensional unstructured grids, which 
we call starry grids. We show that cache efficiency of starry grids is the same as 
that of structured grids. 

Definition 9. We call a grid a d- dimensional starry grid if it can be mapped to a 
grid in d- dimensional Euclidean space that has the following two properties: 

• The ratio of the length L of the longest edge of the grid to the length I of the 
shortest edge of the grid is limited by a constant independent of the number of 
grid points: 

L<col. (10) 

• There are no grid points at a distance shorter than I . 

A vertex of a starry grid can be adjacent only to grid points contained in a 
ball of radius L centered at this point. On the other hand, a ball of radius I around 
any vertex of a starry grid is free of other grid points. Hence, the vertex degree of a 
e?-dimensional starry grid is bounded by a constant depending on d only. Another 
property of starry grids is that any subgrid is starry as well. Any grid with |V| 
vertices is a (|V| — l)-dimensional starry grid, since it can be mapped onto the 
vertices of a unit (|V| — l)-dimensional simplex. In this paper, however, we consider 
d to be fixed, and |V| to be a parameter. 

Starry grids are common in many computations, such as those involving par- 
ticles distributed in a finite domain and interacting via a short range potential. 
Unstructured discretization grids for solving partial differential equations are often 
starry as well. While there is no obvious way to verify that an arbitrary grid is a 
e?-dimcnsional starry grid, starriness of a grid embedded into Euclidean space can 
easily be verified. One simple way to construct a starry grid is to delete some 
vertices and their incident edges from a structured grid. 

We will show that a d-dimensional starry grid G = (V, E) can be covered 
by sets of size S having at most c|V A |S'~3 boundary points in total, see Theorem 
|l2| . This covering, as in the case of planar grids, is based on a cutting theorem, 
specifically, the Hyperplane Cut Theorem (Theorem |ll|) , asserting that there is a 
hyperplane bisecting V into almost equal sized parts while cutting at most c| 3~ 
edges of the grid. From this we deduce the main result of the section. 
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Theorem 10. Any first-order operator K on a starry d- dimensional grid G = 
(V, E) can be computed with 0(\V\S~^) replacement loads. 

Proof. First, we reorder vertices of the grid in such a way that the vertices of 
each set of the covering occupy contiguous memory locations. Then we compute K 
within each set separately. In this computation replacement loads can happen only 



at the vertices of the cuts of the grid. According to Theorem 12 the number of such 
vertices does not exceed 0(\V\S~d). 

From a lower bound on the number of replacement loads for computations 
of operators on structured d-dimensional grids, see Section it follows that this 
bound can not be improved. 

It is not difficult to see that any continuous convex body has a small bisector 
(for example, a hyperplane orthogonal to a diameter of the body). It is not sur- 
prising, then, that if a grid represents the body well, it has a small bisector size 
as well. We will construct a set containing 3d vectors, and show that the bisecting 
hyperplane can be chosen to be normal to one of these vectors. This constitutes an 
algorithm of complexity C(|l^| 2 ) for finding such a bisector. 

The following is a multi-dimensional analog of the Lipton-Tarjan cut theorem. 

Theorem 11. (Hyperplane Cut Theorem) For any starry d-dimensional grid G = 
(V,E) there is a hyperplane cut of size 0(\V\^~) separating vertices of the grid 
into two sets of size at least \V\/3. 

Proof. Let us consider vectors of unit length u; G lZ d , = 1, i = I, . . . ,k such 
that all vertices have different orthogonal projections onto each u^. Then we choose 
slabs Si bounded by hyperplanes orthogonal to Ui : Si = {x e lZ d \\i < u,x < /^}, 
trisecting V, see Figure ||| that is 

|{v G l>. iV < Xi}\ > \V\/3 while |{v G y|u 4 v < AJ| < \V\/3, (11) 



|{v G y|u 4 v > m}\ > \V\/3 while |{v G V\u^ > m}\ < \V\/3. 

meaning that each slab contains at least \V\/3 points, while at most \V\/3 grid 
points can be contained strictly inside a slab. We will show that Ui, . . . , Ujt can be 
chosen in such a way that at least one slab is wide, that is hi = fii — \i > OdV^^l). 
If \V\ is big enough, then it follows from |l(]) that no grid edge can pierce this slab, 
hence there exists a hyperplane Hi = {x : u^x = r/i}, Xi < rji < [ii that intersects 
at most 0(| V| "S - ) grid edges. This hyperplane separates grid points into sets 
containing at least \V\/3 points. 

Let V 1 be the set of grid points contained in exactly i slabs, and let Vf be 
the subset of V 1 contained in Sj. Since V 1 ,i = l,...,k, are disjoint sets and 
^i<kV l C V, we have 



]Tin<m 



(12) 
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Since each slab contains at least |V|/3 grid points, it follows that 

£|v7l>|vq/3. 

j 

If we sum all points in all slabs, then each point in V % will be counted exactly i 
times, hence 

E ^i>tm 

l<z<fc 

and 

E^i ^ - E >(^-d+w\: (13) 

where the last inequality follows from ([l2|). Let Pj, / = . . . , id}, 1 < «i < 
• • • < id < be a parallelepiped formed by the intersection of d different slabs. 
Obviously, Ui>dV l C U/Pj. Let [// be the number of grid points in Pj, then 

E^E (iVi> ia*E*ih- 

If we now choose k = 3d, it follows from (|l^) that 

E^/>~m (14) 
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According to the second property of starry grids, the balls Bi of radius I, centered 
at the grid points, do not intersect, so 

where c^,l d = vo\(Bi). 

Now we exercise our freedom to select the vectors u^. We choose them to be 
the rows (normalized to unit length) of a 3d x d generalized Vandermonde matrix 

W = \(i~a) j \, i = l,...,3d, j = 0, . . . , d - 1, < a < 1 

and choose a in such a way that Uj(v p — v q ) =/= 0, where Vp,v 9 are different grid 
points. These gives us at most 3d\V | 2 nontrivial constraints. We choose a such that 
inequality holds for all of them. Then each parallelepiped Pj may be described by 
a system of inequalities 

-hi < D 7 W}(x - x/) < hi , 

where Wi is a square matrix, consisting of rows (ii, . . . , i<$) of W, and hi = 
(hi 1 , . . . , hi d ) 1 , hi s is the halfwidth of Pi in the direction of the vector Ui s , x/ 
is the center point of Pi and Di is the normalizing matrix. Hence, Pi + Bi is 
contained in a ball of radius with 

Ri = \DJ 1 \\Wf 1 \\hi + l\<{3d) d Vd 11 (i-j)(h + l) < 3 2d d 2d+1 ^(h + l), 

where h — maxi<i<3c;{/ii} and all matrix norms are those induced by the Euclidean 
distance. If h < Z, then each Pi contains at most one grid point, which contradicts 
( |l2| ) for |V| large enough, hence h > I, and we have 

vol(P/ + B{) <(h + l) d vo\{Pi) < c 4 h d . 

From this inequality, together with (^) and (|l5|), it follows that 

c s „ ls (i) J . 

This results in the desired lower bound for the width of a slab containing at least 
\V\/3 grid points 

h > c 6 \V\ 1/d l. 

Now we show that there is a hyperplane parallel to the boundaries of the 
slab which intersects at most C6|^|^~ edges of the grid. We divide the slab into 
> ce-^l^l 1 ^ > C7|F| 1 / d parallel slices of width L. The number of these slices is 
bigger than 1 if \V\ is sufficiently large. Since the total number of grid points in the 
slab does not exceed |V|, at least one slice contains fewer than cslV] -3- grid points. 
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If a grid edge intersects bisector H of the slice, then at least one end of the edge is 
inside the slice (since the slice has thickness L and the length of no edge exceeds 
L). Hence, the total number of the edges intersecting H does not exceed eg ^ <5, 
where S is the degree of the grid, which is bounded for starry grids. Since H is 
inside the slab, it separates the grid points into sets containing at least \V\/Z points 
each. □ 



Now we can formulate our covering result, which implies that computations of 
local operators of fixed order on starry grids can be performed with the same cache 
efficiency as those on structured grids. 

Theorem 12. Nodes of d- dimensional starry grid G = (V, E) can be covered by 
sets of size not exceeding 5 and with total boundary 0{\V\S~' 3 ). 



Proof. The proof closely follows that of the main result of Section 5.1. As was 
stated in the beginning of the section, any subgrid of a starry grid is starry of the 
same dimension. Hence, we can construct the covering by applying the Hyperplane 
Cut Theorem recursively. First, we choose any bisector cutting at most Co | | ~ a_ 
edges of the grid G, where Cq is independent of G. According to the Hyperplane 
Cut Theorem, the bisector can be chosen in such a way that it splits the grid into 
connected components d = (Vi,Ei), i = 1, . . . , k, \Vi\ < 2\V\/3. Adding an extra 
step in this partition, we can assume that \Vi\ < \V\/2, while the number of edges 
cut by the bisector does not exceed c\ \ V \ for a bigger constant c\ . We recursively 
bisect each connected component Gi = (Vi,Ei) while \Vi\ > S. 

This partition process can be represented by a cut tree T whose nodes are 
partitioned connected components of the grid. Let a connected component C, rep- 
resented by node t of T, be split into components C%, . . . , C m by a hyperplane cut. 
We draw an edge between t and each of its child nodes representing C\ , . . . , C m . 
However, we do not include in T connected components of size smaller than 5. To 
each node t of T we assign size s(t), which equals the number of vertices in the set 
represented by t, and weight w(t) — s(t)~z~ . From the definition of the cut tree it 
follows that the size of each leaf exceeds 5. The total number of edges in all cuts 
can be bounded by 0(a(T)), where 

a(T)= W W ( 16 ) 

t node of T 

We use two properties of the weights. First, 

]T w(l)<\V\S~i, (17) 



l leaf of T 

d-l 



since the maximum of ^ s(l)~*~ for sizes of the nodes, subject to s(l) = \ V\ and 
s(l) > 5+1, is attained at s(l) = 5+1 for all Second, as follows from Proposition 
|l6| (see Appendix), for any node t of T we have 

w(t) < c ^2 W ( T ) ( 18 ) 

r child of t 
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for some c < 1 independent of the grid size. 

Now er(T) can be estimated in two steps. First, we replace weights in each 
nonleaf t by the right hand side of (|l8|), going bottom up from the leaves to the 
root of T. This operation will not decrease the total weight. Second, we carry 
the summation of the new weights across nodes of T by noticing that each leaf I 
deposits into the total sum a(T) a weight of at most 

w(l)(l + c + c 2 + ...) =w(l)c 5 . 



Hence, it follows from (17) that 

<r(T)<c 5 \V\S-^ , (19) 
meaning that the total number of edges in all cuts is ©(IVIS 1- ^). p 



5.3 Cache Unfriendly three-dimensional Grid 

In the previous section we provided a characterization of cache-friendly higher- 
dimensional unstructured grids, called starry grids. These are not a trivial exten- 
sion of planar, bounded degree grids to higher dimensions. To demonstrate this, 
we present a three-dimensional grid of bounded degree that is intrinsically cache- 
unfriendly More formally we construct a three-dimensional grid of N vertices 
which has a subgrid G of size cN that does not contain small subsets with a small 
surface-to-volume ratio. Using this property, following the arguments of Section ^, 
it can be shown that for any computation of a symmetric operator defined on this 
grid, £l(N/\ogS) replacement misses must occur. 

Our construction is based on embedding an FFT butterfly graph into a tri- 
angulation of a three-dimensional cube. The 2™-point FFT graph, denoted by F n , 
has (n + 1)2™ = N vertices, arranged in n + 1 layers of 2™ vertices each, see Figure 
||. In other words, vertices of F n form an array (k,i), < k < n, < i < 2™ — 1, 




Figure 5. Recursive construction of FFT graph. F n is built from two 
copies of F n -\ by adding an (n+ \) th layer of 2 n vertices and connecting them with 
vertices of the n th layer using butterflies. 
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and a vertex (k, i), k < n is connected with vertices (k + and [k + l,i@2' £ ), 
where i0 2 k signifies taking the complement of the k th bit of i. 

Theorem 13. The number of cache misses /i in the evaluation of a symmetric, 
first-order operator on F n is bounded as follows: 

/i > -iV(l + -^-) , 
w log 5 

where c is a constant and N — (n + 1)2™. 

Proof. The theorem is a direct sequence of Theorem || of Section [| and of an 
estimation of the size of the boundary of vertex coverings of F n , given in Proposition 

IE □ 



Proposition 14. Let F n = [JVi, i = l,...,k, \Vi\ < S be any partition, and 
S < 2™/ 24 . Then the following inequality holds for the sum of the numbers of 
boundary vertices of the sets of the partition: 

Eiww^iEis- (20) 



Proof. For any subset V C F n it follows from (|2l|) that S(V) > ±\V\/log\V\, and 
we can estimate the sum of boundaries of the partition as follows: 



£ \d( Vi )\ > ^ W) - 6 • 2" > - X; • 2™ 

1 fe N 

> y^lVJI -6-2" > , 

- 2 log 5 1 " 41ogS*' 

where the boundary operator 6 is as defined in Lemma fsl The last inequality holds 
since S < 2"/ 24 . □ 



Lemma 15. Let V be any nonempty node subset of grid F n , n > 1, and let E be 
the set of all edges of F n incident on at least one node of V . Then we have 

\V\< 28{V) log S(V), (21) 

where 5(V) is sum of the number of boundary nodes in V and of the number of 
boundary edges of E. A node ofV is called a boundary node if it is in layer or n 
of F n , and an edge of E is called a boundary edge if it connects a vertex in V with 
a vertex not in V. 
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Proof. Our proof of inequality (j2l|) is based on induction and is similar to the proof 
of Theorem 4.1 in 0]. The base of induction for n = 1 is obviously true. Next, let 
V be partitioned into three sets A, B and C, as shown in Figure |[ Sets A and B 
contain the nodes of V that are in the two separate F„_i subgraphs of F n . Sets Aq 
and i?o are contained in the last layer of these F n -i. Set C contains the n th layer 
of nodes of V in F n . The subset of nodes of F n that constitutes V is indicated by 
open circles. 




Figure 6. Induction step for proving the surface-to-volume inequality of 
a subset in F n . We can assume without loss of generality that \Aq\ > \Bq\. 

Further, we define the partition C = C 2 U C 1 U C°, where C % is the subset of 
nodes in C that are adjacent to exactly i nodes in A U B. Now the equality 

\v\ = iai + isi + ic^ + i^i + ic ! 

follows from the fact that AU BUC is a partition of V, with A and B contained in 
•fn— l- 

Now consider those edges of the last layer of F n that are incident on a vertex of 
V. These edges can be partitioned into 5 groups: those incident on nodes in A and 
C, B and C, A only, B only, and C only. We denote these sets by Eac, Ebc,Ea,Eb 
and Ec, respectively. Since two such edges are incident on each vertex in A, B and 
C, we have the following two relations: 

\E A \ + \E AC \ + \E B \ + \E BC \ - 2(14,1 + \Bo\) 

\E C \ + \E AC \ + \E BC \ = 2\C\ 

From these equalities it follows that 

\E A \ + E B \ + \E C \ = 2(141 + |Bo| - |C|) + 2\E C \. 

We determine the size of the boundary of V by adding to A U B the nodes 
contained in the last graph layer of F n , i.e. those in C. This causes some boundary 
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nodes of A and B to cease to be boundary nodes, namely those in A and B , while 
all nodes in C are now new boundary nodes of V. Subsequently, we determine 
the effect of adding C on the number of boundary edges. It is easy to see that all 
edges in V that were boundary edges of A or B are also boundary edges of V. The 
boundary edges that are newly created by adding C are those that connect a node 
of Aq or Bq to a node not in C, as well as those that connect a node of C to a node 
not in Ao or Bq. These new edge sets are exactly the following: Ea, Eb, and Ec, 
which are all disjoint. Consequently, we find: 

5(V) = 6(A) + 5(B) - \A \ -\B \ + \C\ + \E A \ + \E B \ + \E C \. (22) 

Taking into account that \C 2 \ < 2\B Q \ and that \E C \ = IC 1 1 + 2|C°|, from the 
previous equations we derive 

5(V) > 5(A) + 5(B) + \Ao\ - \B \ + IC 1 ! + 3|<7°| . (23) 

Using induction, and assuming that 5(B) is nonzero, we obtain 

\V\ < 2(6(A) log 5(A) + 5(B) \og6(B) + \B \) = 2(5(V) \og5(V) - X) 

where 

X > (5(A) + 5(B) + D+\C 1 \+ 3|C°|) \og(5(A) + 5(B) + D+\C 1 \+ 3|C°|) 
- 5(A) log5(A) - 5(B)log5(B) -\B \, 

and D = \A \ - \B \. Because D > 0, \C a \ +3|C°| > 0, and 5(A), 6(B) > 1, we get 

X > 5(A) log(l + |||) + 5(B) log(l + |||) - |Bo| ■ 
Since |B | < 6(B) and \B \ < 5(A) we finally obtain: 



X > LBnlloe 



( 5(A) 6_(B)_\ „ 
^)r . 



> . (24) 



If <5(-B) is zero, the proof simplifies significantly, since now we have to show nonneg- 
ativity of X, with 

X > (5(A) + \A \ + IC 1 ! + 3\C°\)\og(6(A) + \A \ + IC 1 1 + 3|C°|) - 6(A) log 6(A) , 

(25) 

which is trivial, since 6(A) > 1. □ 



The FFT graph can be embedded into a triangulation of a three-dimensional 
cube. An inductive step of embedding of the FFT graph into a triangulation of 
simplices is shown in Figure ^. The simplices can be embedded into a cube, as 
shown in Figure ^|, which then can be partitioned into parallelepipeds. Finally, each 
such parallelepiped can be triangulated. 
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Assuming that we have an embedding of the edges connecting the last two 
layers of F n -\ (butterflies) into the triangulation of a simplex, we embed the but- 
terflies connecting the last two layers of F n into a triangulation of a simplex. We 
map the vertices of the last two layers F n into equidistant points on two crossing 
edges of a simplex, see Figure ^. If we linearly parametrize these crossing edges and 
connect corresponding points by lines, we build a ruled surface. A ruled surface can 
be viewed as a hyperboloid containing the two crossing edges and the connecting 
lines. Each ruled surface separates a simplex built on the appropriate vertices into 
two parts, as shown in Figure |§ the top view is shown in Figure B. 

We map the edges of the butterflies onto lines of one of the ruled surfaces sep- 
arating simplices (to, t§, 67, 64) and [ty, £4, 63, bo) and two other ruled surfaces sep- 
arating simplices (to, £3, 63, bo) and (t?, t^, 64, 67), respectively. The whole simplex 
(to, t7, 67, 60) can be partitioned into the four simplices listed above, and 5 prim- 
itive simplices which will not be further partitioned: (£3, £4, 63, 64), (£3, £4, 64, 67), 
(t3,t4,&3,M, (to,t3,b4,,b 3 ) and (t 7 , t 4 , 64, 63). Each of the simplices (t , h, b 7 , b 4 ), 
(tj, £4, 63, 60)1 (^o, £3, ^0) and (t7, £4, 64, 67) is divided by a ruled surface, hence it 
is sufficient to build a triangulation of a simplex separated by a ruled surface. This 
can be done in 2 steps: 1. partitioning the simplex into prisms by planes parallel to 
the other crossing edges of the simplex (see Figure [l(]), and 2. triangulating each 
of the prisms into primitive simplices. 

Finally, we embed the triangulated simplices into a cube and augment it by a 
triangulation of the space between simplices and the cube, as shown in Figure ||. 

It is easy to verify that the total number of vertices in the triangulation does 
not exceed M = 3n2™, and that the degree of each node does not exceed 16. Hence 
we have constructed a triangulation having the property declared at the beginning 
of this section. 




Figure 7. Recursive con- 
struction of embedding of FFT graph 
into a triangulations of a simplex. 



Figure 8. (Partial) recur- 
sive triangulation of a cube (view ro- 
tated with respect to Fig. [?]). 
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to 




Figure 9. Embedding one Figure 10. Triangulation of 

layer of an FFT graph into a triangula- a simplex separated by a ruled surface 
tion of a simplex, top view. via plane sections parallel to the edges 

(6o,*o) and (b 3 ,t 3 ). 



6 Related Work and Conclusions 

Because of the growing depth of memory hierarchies and the concomitant increase in 
data acess times, the reduction of cache misses in scientific computations remains an 
active subject of research. One of the first lower bounds for data movement between 
primary and secondary storage was obtained in Recent work has focused on 
developing compiler techniques to reduce the number of cache misses. In this direc- 
tion we mention [Jfj , where the notion of cache miss equation (CME) and a tiling of 
structured grids with conflict-free rectilinear parallelepipeds were introduced. Some 
tight lower and upper bounds for computation of local operators of fixed radius on 
structured grids were obtained in 0], where a tiling with a reduced fundamental 
parallelepiped of the interference lattice was used for reduction of cache misses. 
Some practical methods for improving cache performance in computations of local 
operators are given in jlC|j . 

In this paper we showed that the reduction of cache misses for computations of 
local operators of fixed radius, defined on structured or unstructured discretization 
grids, is closely related to the problem of covering these grids with conflict-free sets 
having low surface-to-volume ratio. We introduced two new coverings of structured 
grids: one with Voronoi cells, and one with rectilinear parallelepipeds built on the 
vectors of successive minima of the grid interference lattice. The cells of both cov- 
erings have near- minimal surface-to- volume ratios. Direct measurements of cache 
misses show a significant advantage of the successive minima covering relative to 
computations using the natural loop order, maximally optimized by a compiler. 
We also showed that the computation of local operators of fixed radius on planar 
unstructured grids can be organized in such a way that the number of cache misses 
is asymptotically close to that on structured grids. Finally, we demonstrated how 
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the latter result can be extended to higher dimensions. 

Appendix: The Weight Inequality 

Proposition 16. For any positive v% > • • ■ > Vk > such that i»i < X^i=2 w i? 
following inequality holds: 

k / k \ 1 / d 

i=i \ i=i i 

Proof. The proof uses Jensen inequality, see ||, Ch 2.10, Th. 19: for < r < s 

(E^T<(5>! /S r 

i=l i=l 

and in particular 



4 — 1 2—1 4 — 1 



Let Xi — v\^ d ^ i = 1, . . . , we have to prove 



i=i i=i 

where x\ < (£) i=2 < X),=2 ^ and - ' ' ' - x k > °- 

Let j < k be the minimal index such that > Y^i=j+i x i- Obviously, j > 1, 
so we have: 

(E x ^ d > E ^ + ^i -1 E + • • • + ^i-i 1 E + E x * 

i—l i—1 i—2 i—j + 1 

k 

> E x i + rfa; i H h + ^ _1 Oi+i H 1" X k) 



i=l 

k k 
= 1 i=l 



^ E + dx l + " ' + dxC J-l + + ' ' ' + dx k > 2 E 

i=l i=l 

since d > 2. If no such j exists then the proof becomes even simpler. □ 
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